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1 Abstract 



Background: The availability of high throughput methods for measurement of mRNA concentrations 
makes the reliability of conclusions drawn from the data and global quality control of samples and 
hybridization important issues. We address these issues by an information theoretic approach, applied 
to discretized expression values in replicated gene expression data. 

Results: Our approach yields a quantitative measure of two important parameter classes: First, the 

probability P{a\S) that a gene is in the biological state cr in a certain variety, given its observed 
expression S in the samples of that variety. Second, sample specific error probabilities which serve as 
consistency indicators of the measured samples of each variety. The method and its limitations are 
tested on gene expression data for developing murine B-ccUs and a f-test is used as reference. On a 
set of known genes it performs better than the t-test despite the crude discretization into only two 
expression levels. The consistency indicators, i.e. the error probabilities, correlate well with variations 
in the biological material and thus prove efficient. 

Conclusions: The proposed method is effective in determining differential gene expression and sample 

reliability in replicated microarray data. Already at two discrete expression levels in each sample, it 
gives a good explanation of the data and is comparable to standard techniques. 
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2 Background 



A broad variety of algorithms has been developed 
and used to extract biologically relevant informa- 
tion from gene expression data. Among others 
commonly used are visual inspection PP, hierar- 
chical and k- means clustering [2j, self organizing 
maps 0^ and singular value decomposition 
These methods aim mainly at identifying predom- 
inant patterns and thus groups of "cooperating" 
genes based on the assumption that related genes 
have similar expression patterns. 

Compared to the amount of work devoted to ef- 
ficient methods to extract information from the 
data, somewhat less attention has been paid to the 
question of the reliability of the generated results. 
The ANOVA analysis j7j allows estimation, and 
thus elimination, of some systematic error sources. 
Bootstrapping cluster analysis estimates the sta- 
bility of cluster assignments [HI based on artifi- 
cial data-sets generated with ANOVA coefficients. 
Some authors also considered the question of how 
well a certain oligo JO] is suited to measure the 
mRNA expression level of the related gene. 

Some work has gone towards the ambitious task of 
learning topological properties or qualitative fea- 
tures of the genetic regulatory network from ex- 
pression profiles, see e.g. [llj . A major limiting 
factor in these attempts is the comparative sparse- 
ness of available data. It is therefore reasonable 
to consider reduced models, for example a Boolean 
representation of the gene activity. It is known 
that many biological properties, for instance stabil- 
ity and hysteresis, can be modeled by the dynamics 
of such reduced models El ■ 

In this work we investigate the possibility of reduc- 
ing complexity of gene expression data by discretiz- 
ing the expression levels. The approach we present 
enables a new way of extracting biologically rel- 
evant information from the data in the following 
way: A biological variety, i.e. a biological system 
defined by the investigator, is represented by sev- 
eral samples which are subjected to gene expression 
analysis. If gene expression levels are discretized 
into n values, and the variety is represented by 



m samples, the number of observable expression 
states for a gene are limited to n™. These ob- 
served states S are modeled as being derived from 
a smaller number of underlying, biological states 
CT, through a measurement process. Rather than 
making static assignments S* — > cr we calculate con- 
ditional probabilities P{a\S). The number of pos- 
sible expression profiles for a gene over a set of va- 
rieties is limited and the probability of each expres- 
sion profile is easily calculated. Since the model we 
use considers both the underlying biology and the 
measurement process it also generates a measure 
of sample coherence in each biological variety. 

We demonstrate the feasibility of this approach for 
a binary discretization of gene expression. For the 
discretization step we use the absent/present clas- 
sification provided by the Affymetrix software 
The outcome of our method on a data set cover- 
ing gene expression in developing murine B-cells 
is compared to the results of a standard analysis. 
We show that even with the crude discretization 
into only two expression levels the method is com- 
petitive to statistical methods based on continuous 
expression levels. 



3 Methods 



3.1 The Model 

A major step in the analysis of gene expression data 
is to separate the biological content of the data 
from measurement and sample specific errors. In 
other words given an observation, i.e. the expres- 
sion values of a gene in several samples representing 
the same biological variety^, one wants to conclude 
on the biological state a, which generated the ob- 
servation. This can be expressed as a conditional 
probability 

(1) 

that a gene is in a certain biological state a given 
the corresponding observed state S. 

^ In the application on which we demonstrate the method 
we consider three different varieties: pro, pre, and mature 
B-cells. The samples in each variety are different cell lines 
arrested at the corresponding stage of development 
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In this work we take an information theoretic point 
of view to estimate this probabihty: The infor- 
mation of interest, the state ct, is "transmitted" 
in a noisy measurement process and potentially 
distorted (Figure Using Bayes' theorem, the 
desired conditional probability Eq. |^ can be ex- 
pressed as: 

P„|5,.«. (2, 

On the right hand side of this equation, P(S'|cr) is 
the probability to observe state S if the underlying 
biological state is cr. In a sense, P{S\a) describes 
the noise characteristic of the measurement pro- 
cess. In the following we will show how this con- 
ditional probability, and the other probabilities on 
the r.h.s. of Eq. Q can be estimated. 

Given a set of m samples representing the same 
biological variety, differences in the expression level 
of a gene between the samples can arise from two 
independent sources: 

1. 



2. 



A separation of these two contributions is possible 
only with an appropriate model for the variation of 
gene expression between the samples. In the choice 
of model, one has considerable freedom within the 
bounds set by biological plausibility. A limiting 
factor on the biological model comes from the type 
and amount of available data. The data used in 
this work contains only four samples for each vari- 
ety. For the model we propose this is the minimum 
number of samples required to estimate the model 
parameters. 



In the discretization of gene expression levels, we 
use only two discrete values, and 1, for the ex- 
pression of a gene in a sample. This means that the 
number of observable states, S, in a variety con- 
sisting of m samples is 2™. With no measurement 
errors we could immediately conclude on the under- 
lying biological state a: the two cases, where all ob- 
servations agree S = (1, . . . , 1) and S — (0, . . . , 0) 
can be mapped to the biological states cti and uo 
respectively, which describe "pure" states without 
variation. The remaining A^— 2 observable states S, 
where the individual measurements disagree, cor- 
respond to biological states a with random varia- 
tion. For the application in our biological study 
with supposedly identical biological systems con- 
tributing to the observable states S, the exact pat- 
tern leading to contradicting observations does not 
carry any information, as long as we assume that 
there are no sample specific errors. Therefore, we 
subsummize all A^ — 2 possible observations as one 
biological state CTr with a random variation. 



The model discussed so far is depicted graphically 
in the left part of Figure ^ where a possible distri- 
bution of the relative frequencies of the three bio- 
logical states is depicted, for the case of m = 4 sam- 
ples. The distribution can be described by three 
numbers: the probabilities P(o'i) and P(cto), which 
contribute to the frequencies of the states S = 
(1, 1, 1, 1) and S = (0, 0, 0, 0), and P(crr) which con- 
tributes to both the frequency of mixed states and 
the two states above. Describing the mixed states 
with only one parameter P{(Tr) implies that the bi- 
ological variation is modeled evenly and identically 



Random variation within the variety. This 
may be caused by temporal differences in re- 
sponse to the stimuli, slightly different en- 
vironmental conditions, genotypic differences 
between samples, etc. 

Sample specific errors. These are mainly caused 
by the measurement process, e.g. differences 
in the treatment of the mRNA, scratched ar- 
rays, and so on. However, outlier samples, 
cultured under considerably different condi- 
tions, also contribute to sample specific er- 
rors. 



The biological rationale for this model is given by 
the following example: If one considers a biolog- 
ical variety such as cells in the retina of the eye, 
then a certain number of crucial genes ought to 
be expressed in all samples. Such genes might in- 
clude rhodopsin, a molecule that responds to light. 
In contrast, genes such as the hemoglobin family, 
which are typical of erythrocytes, ought not to be 
expressed in the retina. A third class of genes could 
be considered as independent of the system in the 
sense that their expression is not directly related 
to the biological system. Such genes may vary in 
expression both due to environmental and genetic 
differences between the samples. 
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distributed independently for each sample. In a as: 
second step, the measurement process with possi- 
ble sample specific errors is modeled as statistically P{ao\S) = ^^'^ p'(s)'^°'°^ 
independent between samples. For each sample i, P{ai\S) — 
we define two parameters, Pq^i and P^^q, denoted p{s\a}fp{<T,.) 
sample specific error probabilities. P{'^r\S) — P{s) 



To introduce the full formalism of our current model 
we start by considering a simple example, again 
for ni — A samples. An observed state S, S = 
{Si, S2, S3, S4) = (1,0,1,0), may be generated by 
the gene being in state cri with the probability: 

P(cTi)(l - PiLo)(^^o)(l - PLo)iPLo): 

or it may be generated by the gene being in state 
(To with the probability: 

Pi<Jo)iPo'^i)il - PLi)iPo^i)il - PoU), 

or it may be generated by the gene being in state 
(Tr with the probability: 

P{<Jr)x 

i[(PoU) + (1 - pumi - pLi) + iPU}]x 
i[(Po^i) + (1 - pu}]m PoU) + (pLo)]- 

With the briefer notation, 

pUo ^ Pi^oSs.,o + {l-Pl^o)Ss.A 

, where 5 refers to the Kronecker delta (i.e. dj^k = 
1 if J = fc and otherwise), we may express the 
distribution of observed states, in the general case 
of binary discretization with m samples, as: 

P{s) - p{ai)UT=iPUo + Pi'^o)UT=iPh^i+ 

(3) 

Altogether the model uses 3 + 2 * m variables. 
These parameters P(cri), P{cro), P{<^r) and 
{Pl^Q, Pq^i}^i are estimated from the observed 
distribution of states (right side of Figure ^1 by 
Levenberg-Marquardt |15| chi-square minimization 
of the unweighted error to the theoretical distri- 
bution Eq. (PJ. Using Eq. (|2Jl, and the parameters 
estimated as above, our belief that a gene belongs 
to the underlying states gq, a±, ar, given the 
2"^ = 16 observable states S, can now be expressed 



Once the probability that a gene is in a certain 
biological state S* G ci, (Jq, has been calcu- 
lated for all varieties i = 1 . . .v, one can calculate 
the probability that a gene exhibits a certain ex- 
pression profile over a set of different varieties by 
taking the product 

V 

p{Y}, s^is-i, . . . , s-") Jl P{J:'\S') (4) 

i=l 

In this way, the probabilistic state analysis also 
generates a clustering: For a given expression pro- 
file over the varieties, e.g. CQa^ ■ ■ • f^, we may ex- 
tract those genes for which this expression profile 
is the most probable. In fact this is a "soft" clus- 
tering, in that an expression profile can belong to 
several clusters simultaneously with different prob- 
abilities. Moreover the genes clustered to a biolog- 
ically interesting expression profile can be ranked 
by the probability of Eq. Q . 



3.2 Experimental data preparation 

All cells were grown in RPMI medium supplemented 
with 7.5% fetal calf serum, 10 mM HEPES, 2 mM 
pyruvate, 50 mM 2-mercaptocthanol and 50 mg 
gentamicin per ml (complete RPMI media) (all pur- 
chased from Life Technologies AB, T=E4by, Swe- 
den) at 37=B0C and 5% C02. RNA was pre- 
pared using Trizol (GIBCO) and 7.5 =B5g of to- 
tal RNA was annealed to a T7-oligo T primer by 
denaturation at 70=B0C for 10 minutes followed 
by 10 minutes of incubation of the samples on ice. 
First strand synthesis was performed for 2 hours at 
42=B0C using 20 U of Superscript Reverse Tran- 
scriptase (GIBCO) in buffers and nucleotide mixes 
according to the manufacturers instructions. This 
was followed by a second strand synthesis for 2 
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hours at 16=B0C, using RNAseH, E coli DNA poly- 
merase I and E coli DNA ligase (all from GIBCO), 
according to the manufacturers instructions. The 
obtained double stranded cDNA was then blunted 
by the addition of 20 U of T4 DNA polymerase and 
incubation for 5 minutes at 16=B0C. The mate- 
rial was then purified by Phenol: Cloroform:Isoamyl 
alcohol extraction followed by precipitation with 
NH4Ac and Ethanol. The cDNA was then used 
in an in vitro transcription reaction for 6 h at 37 
=BOC using a T7 IVT kit and biotin labeled ri- 
bonucloetides. The obtained cRNA was purified 
from unincorporated nucleotides on a RNAeasy col- 
umn (Qiagen). The eluted cRNA was then frag- 
mented by incubation of the products for two hours 
in fragmentation buffer (40 mM Tris-acetate, pH 
8.1, 100 mM KOAc, 150 mM MgOAc). 20 =B5g of 
the final fragmented cRNA was then hybridized to 
affymetrix chip U74Av2 (Affymetrix) in 200 =B51 
hybridization buffer (100 mM MES-buffer, pH 6.6, 
1 M NaCl, 20 mM EDTA, O.OlHerring sperm DNA 
(100 =B5g/ml) and Acetylated BSA (500 =B5g/ml) 
in an Affymetrix Gene Chip Hybridization oven 
320. The chip was then developed by the addition 
of FITC-streptavidin followed by washing using an 
Affymetrix Gene Chip Fluidics Station 400. Scan- 
ning was performed using a Hewlett Packard Gene 
Array Scanner. 



4.1 Synthetic data and the effect of 
correlation 

For synthetic data, generated with the model pa- 
rameters^, P(cto) = 0.45, P{ai) = 0.35, P{ar) = 
0.2 and Pi^o = Po^i = 0.02 for all samples i, 
parameter estimates are, as expected, given with 
low errors. This result was verified for sample sizes 
TO = 4, m = 5, and to = 6 (data not shown). 

An assumption of simple model used to derive Eq. 
is that randomly varying genes vary indepen- 
dently in the samples of a variety. Hence we inves- 
tigated how severely this assumption influences the 
estimation of the model parameters. 

To assess the influence of correlations between ran- 
domly varying genes we generated a data set con- 
sisting of four bits, i.e. samples, with the same pa- 
rameters as above. In the random patterns a corre- 
lation was introduced between the third and fourth 
bit by changing the value of of the fourth to that of 
the third with a certain probability. We define this 
probability as the correlation factor. The correla- 
tion was introduced before distorting the patterns 
with error probabilities. We then plotted the mean 
error in the estimation of parameters over 500 runs 
of synthetically generated data for correlation fac- 
tors in the range {0, 0.02, . . . , 0.98}. 



4 Results 



To evaluate the method we used both real and syn- 
thetic data. The experimental data was generated 
with Affymetrix microarrays for the study of differ- 
entiating murine B-cells at different stages in the 
differentiation process. In this publication the data 
is only used to demonstrate the feasibility of the 
proposed method. The biological implications of 
this study are published elsewhere ^U). 



Figure El shows the error in the estimation of the 
parameters describing the underlying distribution. 
We notice that even for fully correlated patterns 
the estimation error is less than 20% of the correct 
values. The estimation of the probability for bio- 
logically varying genes is somewhat worse, for fully 
correlated patterns the error is almost 50%. For 
real data one can, however, expect a much smaller 
correlation. The average error in the estimates of 
the error probabilities, as seen in Figure 13 shows 
the expected behavior: The average error grows 
with the correlation for the uncorrelated samples, 
while the estimate for the correlated observations is 
almost unaffected. Intuitively, the model compen- 
sates for the correlation by increasing P(cri) and 
P(tTo) as well as the error probabilities and lower- 

^ These values were chosen as typical values from the es- 
timates on real data. See next section 
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ing P{(Jr)- For correlation factors above 0.50, due 
to the compensation effect, the model deteriorates 
in explaining the data. This can be seen in the sum 
-P(o'o) + -P(o'r) + P{cri) which initially drops from 
almost 1 to 0.99 as the correlation factor rises from 
to 0.50 and then from 0.99 to 0.96 for correlation 
factors in the range 0.50 to 0.98 (data not shown). 
We hence conclude that it is reasonable not to im- 
pose the condition P{<To)+P{ai)+P{(Tr) — 1 in the 
model, as this sum indicates if samples are strongly 
correlated in genes whose expression vary around 
the threshold of discretization. 

In summary, for not too large correlations in the bi- 
ological variance the algorithm gives a good quan- 
titative estimate of the model parameters. In the 
case of large correlations the qualitative picture 
given by the estimated parameters is still reliable. 



4.2 Real data 

Differentiating B-cells are characterized by pheno- 
typic markers into different stages of development. 
Here we chose to study the expressional differences 
between three such stages; pro, pre and mature 
B-cells. For each of these three varieties we used 
four different cell lines arrested at the correspond- 
ing stage of development. Measurements we per- 
formed with Affymetrix array containing probesets 
for 12488 genes and ESTs on each sample. The 
discretization of expression levels was given by the 
Affymetrix GeneChip absent present calls jO]. 

Our algorithm was used to estimate the parame- 
ters P{cTi), -P(o'o) and -P(o'r), describing the bio- 
logical distribution and the error probabilities (see 
Table nj. Theoretically, one expects the three bi- 
ological probabilities to sum up to one. In our 
model, Eq. (jJJ, we do not explicitly impose this 
condition. Nevertheless, the sum of the indepen- 
dently estimated parameters is close to one. This 
indicates that our model is a reasonable approxima- 
tion of the biological system and the measurement 
process. 

The error probabilities from Eq. iPJ can be used as 
a consistency index for the samples in a given va- 



riety. In the last variety (mature B-cells) the max- 
imum error probability is notably higher. This ef- 
fect is likely to be explained by the different anatom- 
ical origins of the cell lines representing this group. 
No such differences exist in the other groups since 
they all originate in the bone marrow which is the 
only anatomical site for B cell development in the 
adult animal ^B]- In contrast, the mature B cell 
can reside in several other sites such as spleen, 
lymph-nodes and intestine which may affect the 
gene expression profile in these cells With 
only four samples, it is not unlikely that these ef- 
fects show up in the error probabilities and not only 
in the random variation parameter P{ar). 



4.3 Comparison to conventional 
t-test on known genes 

To determine how well biologically relevant infor- 
mation can be extracted from the discretized data, 
we compare it with another statistical method based 
on continuous expression values. We use our method 
to identify differences in gene expression between 
two varieties in the following way. A gene that goes 
up between variety i and variety j is characterized 
by the states o-g, or o-g, or cr*, a{. Hence the 
belief that a gene goes up is given by the probabil- 
ity^: 

P(up between variety i and j) = 

Similarly, the belief that a gene goes down is given 
by the probability: 

P(down between variety i and j) = 

Pi<)PK) + P{OP{oi) + P{<)P{oi) 

Taking 1 — /'(up) thus yields the Bayesian p- value 
of a gene going up. To answer the same question 
when working on continuous expression data one 
possibility is to employ a one sided two sample t- 
test in the Welch approximation of unknown vari- 
ances in the varieties. This enables testing, for each 
gene, whether the mean of expression is higher or 

^Suppressing the conditional probabilities, P{-\S), for 
brevity 
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lower in one variety than in another. For com- 
parison of these two approaches we selected a set 
of genes based on their well documented expres- 
sion pattern and biological functions in the devel- 
oping B lymphocyte ^lEl- Several of these are 
functionally linked since they participate directly 
in somatic DNA rearrangement events occurring 
specifically at the pre-B cell stage or participate in 
the regulation of genes involved in this process and 
thus display restricted expression patterns (pre-B 
specific) . A second set of genes were selected based 
on their expression in cells that are either commit- 
ted to the B lineage (B-lineage specific genes, in 
pre-B and B-cells) or non committed to this devel- 
opmental pathway (Not in B-lineage, expressed in 
pro-B cells) [H]. 

The result of this comparison is presented in Ta- 
ble |21 For 14 out of the 22 genes the two methods 
completely agree. Out of these 14 only one (Mbl) 
does not match the expected target profile. For the 
other 8 genes, where the two methods yield differ- 
ent results, the probabilistic state analysis gives the 
expected answer in 5 cases, which should be com- 
pared to the two cases, where the i-test gives the 
right answer. In one case (rag-1), neither of the 
two methods gives the expected result. 

For the subset of genes considered here, our method 
has an advantage of 5 : 2 in giving the correct (i.e. 
expected) expression pattern. However, the num- 
ber of samples is not big enough to draw firm con- 
clusions from this result. 



5 Conclusions 

The method we have presented serves several pur- 
poses: 

1. It gives a measure of the biological variation 
of the genes' expression in different varieties. 

2. It estimates each hybridizations' global error 
probabilities. These parameters are very use- 
ful as they serve as quality/consistency indi- 
cators of the samples of each variety. 



3. Given the parameters above, it estimates the 
probability of a gene belonging to each of the 
three groups ctq, <Jt and ci. These probabili- 
ties in turn indicate weather the gene is likely 
to be below, fluctuating around or above the 
threshold of discretization. 

4. Clustering of genes to expression profiles over 
a set of different varieties is achieved with 
Eq. (@J. The probability, i.e. belief, that a 
gene belongs to a certain cluster is exactly 
quantified. 



This novel approach is proven valuable for quan- 
tifying both data reliability and underlying gene 
expression in microarray experiments. Our method 
has been successfully applied in two different projects 

EDI- 
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7 Figures 



Underlying Distribution Measured Distribution 




Figure 1: Schematic diagram illustrating the transition from underlying to observed distributions of 
states, in the case of m = 4 samples. The underlying distribution on the left hand side can be described 
by the probabilities for each underlying state, P{(Ti), P{(To), and P{ctt) (see text). This distribution 
is then distorted by sample specific errors, Pq^i and Pl^o, resulting in an experimentally observed 
distribution, depicted on the right hand side. 




Figure 2: The average error in the estimation of the parameters P{cri), P(a'o), P{(Tr) are given as a 
function of correlation factor between the third and fourth bit. For correlation factors above 0.2 the 
error in P{(Jr) rises considerably. 
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Figure 3: The average error in the estimation of the error probabihties {^o^ilf^i- For correlation 
factors above 0.2 P^^q and Pf_^o notably raised. Patterns were these bits deviate from the other 
two are then not considered as random but rather caused by an error. This effect could only be avoided 
by introducing extra parameters for correlation between bits. 

8 Tables 



8.1 Typical paramter values 









P(ar) 


Max Pe 


Min Pe 


Median Pe 


Pro B-ccUs 


0.405 


0.460 


0.135 


0.035 


0.0002 


0.020 


Pre B-ccUs 


0.395 


0.450 


0.155 


0.047 


0.003 


0.028 


Full B-cells 


0.343 


0.471 


0.186 


0.073 


0.0007 


0.022 



Table 1: Summary of the estimated parameter values for the B-cell data. Pg refers to the set of error 
probabilities, i.e., [PiUo> ^o^l]f=l• 
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8.2 t-test vs. probabilistic analysis of gene expression levels 
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Target profile 




u 


S 


u 


U 
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u 




Bob-1 


93915_at 


u 


S 


u 


u 


s 


u 




CD 19 


99945_at 


u 


s 


u 


u 


D 


u 




Blnk 


100771^t 


u 


s 


u 


u 


s 


u 




Pax-5 


96993_at 


u 


s 


u 


u 


s 


u 


B-lincage 


Blk 


92359_at 


u 


s 


u 


u 


s 


u 


Specific 


Mb-1 


102778^t 


u 


D 


s 


u 


D 


s 




B29 


161012_at 


s 


s 


s 


u 


s 


u 




CD24 


looeoo^t 


u 


s 


u 


u 


s 


u 




Target profile 




D 


s 


D 


D 


s 


D 




Id-1 


100050.at 


D 


s 


D 


D 


s 


D 




Fag-1 


97974_at 


S 


D 


D 


D 


s 


D 


Not in 


11-3 receptor 


94747_at 


D 


s 


D 


D 


s 


D 


B-lineage 


CD 63 


160493_at 


D 


s 


D 


D 


s 


D 




Gata-2 


102789 Jit 


D 


s 


D 


D 


D 


D 





Table 2: The three groups I, II and III indicate the expressional changes between Pro-B to Pre-B, Pre-B 
to Mature-B, and Pro-B to Mature-B respectively. U stands for accepting the hypothesis up, D for 
down, and S (stable) if no hypothesis could be accepted on the 95% confidence level. 
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